We created an open source Python Package 'lobpredictrst' to predict mid price movements for the AAPL LOB stock
Describe what the project is about and roughly our approach in each of the following 5 sections
The data we used in this project comes from the limit order book (LOB) and the message book of Apple stock. In LOB, there are 10 levels of ask/bid prices and volumes. The data is quite clean since there are no missing values nor outliers.
Add the time and limitations (only one morning, split chronologically, does not allow for seasonality effects)
Mention that we experimented with the MOB data but chose not to add the features in the end. It was sparser than originally thought
We used 6 features as Kercheval and Zhang in their study. The first 40 columns are the original ask/bid prices and volumns after renaming. Then the next four features are in the time insensitive set. It contains bid-ask spreads and mid-prices, price differences, mean prices and volumes accumulated differences. The last four are time-sensitive features including price and volume derivatives, average intensity of each type, relative intensity indicators, accelerations(market/limit).
Insert the Kercheval and Zhang table (screenshot) here. With a caption acknowleding the source paper.
Insert summary statistics for key columns in the training set to give a exploratory feel of the underlying raw features.
In time-sensitive features, the biggest problem we encountered is the choice of $\Delta t$. Also, the choice of $\Delta t$ is correlated with labels. Mainly we would like to predict stock prices by mid-price movement or price spread crossing. Price spread crossing is defined as following. (1) An upward price spread crossing appears when the best bid price at $t+\Delta t$ is greater than the best ask price at time $t$, which is $P_{t+\Delta t}^{Bid}>P_{t}^{Ask}$. (2) A downward price spread crossing appears when the best ask price at $t+\Delta t$ is smaller than the best bid price at time $t$, which is $P_{t+\Delta t}^{Ask}>P_{t}^{bid}$. (3) If the spreads of best ask price and best bid price are still crossing each other, than we consider it is no price spread crossing, which is stable status. In this case, compared to mid-price movements, price spread crossing is less possible to have upward or downward movements, particularly in high frequency trading since big $\Delta t$ might be useless. According to our test, even we use 1000 rows as $\Delta t$, we still get $92\%$ stationary labels.
In the previous section, we explain the importance of picking good $\Delta t$. In this section, we explain how we pick it in detail.
$\Delta t$ affect our model and strategy in at least two ways:
The tradeoff we are facing can be understanded this way:
In high frequency trade, any current information for profit opportunity is only valueable within an extremely short period of time. And any profit opportunity is completely exploited within a few millsecond. This implies:
There is an important benefit of large $\Delta t$. Very small $\Delta t$ results in extremely high proportion of 'stationary' label, meaning that the price measure doesn't change. Highly imbalanced label makes machine learning algorithm too easy and make the information less efficiently used. It actually induces the machine learning algorithm to cheat by ignoring the features and only predicting too much 'stationary'.
In practice, we look at the proportion of each category of labels 'up', 'stationary', 'down' for different $\Delta t$. The plot is shown below. Looking at the graph, we see that the proportion of 'stationary' falls down quickly for Midprice lables but very slowly for bid-ask spread crossing. In the end, we pick $\Delta t_{MD} = 30$, because the proportion of 'stationary' falls down quickly before 30 and and slowly after 30. 30 is too large and gives us litter enough 'stationary'. We pick $\Delta t_{SC} = 1000$, because the proportion at 1000 are about 0.33, 0.33, 0.33. We really like this balance property. However, we acknowlege that it is probably too large. However, as a machine learning excercise, we decide to care more about whether the algorithm is going to work better or not and sacrifices some really essential practical issues.
For future extensions of this work, we can consider picking $\Delta t_{SC} = 30$ (say) and oversampling up/ down movements to get a better data for modelling purposes. This would be another way to help mitigate the risk of the class imbalance problem inherent in the SC approach.
Delete the following old skeleton version '''
According to the requirement of the project:" 1. You can place only market orders, and assume that they can be executed immediately at the best bid/ask. 2. Your position can only be long/short at most one share." These requirement means
We construct two trading strategies based on predictions of our best random forest models with bid-ask spread crossing labels. The two strategies are called simple strategy and finer strategy.
lobpredictrst